{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Neural Nets for Digit Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### by Khaled Nasr as a part of a <a href=\"https://www.google-melange.com/gsoc/project/details/google/gsoc2014/khalednasr92/5657382461898752\">GSoC 2014 project</a> mentored by Theofanis Karaletsos and Sergey Lisitsyn "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook illustrates how to use the NeuralNets module to teach a [neural network](http://en.wikipedia.org/wiki/Artificial_neural_network) to recognize digits. It also explores the different optimization and regularization methods supported by the module. [Convolutional neural networks](http://en.wikipedia.org/wiki/Convolutional_neural_network) are also discussed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An Artificial Neural Network is a machine learning model that is inspired by the way biological nervous systems, such as the brain, process information. The building block of neural networks is called a neuron. All a neuron does is take a weighted sum of its inputs and pass it through some non-linear function (activation function) to produce its output. A (feed-forward) neural network is a bunch of neurons arranged in layers, where each neuron in layer *i* takes its input from all the neurons in layer *i-1*. For more information on how neural networks work, [follow this link](https://www.youtube.com/playlist?list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH).\n",
    "\n",
    "In this notebook, we'll look at how a neural network can be used to recognize digits. We'll train the network on the USPS dataset of handwritten digits.\n",
    "\n",
    "We'll start by loading the data and dividing it into a training set, a validation set, and a test set. The USPS dataset has 9298 examples of handwritten digits. We'll intentionally use just a small portion (1000 examples) of the dataset for training . This is to keep training time small and to illustrate the effects of different regularization methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%pylab inline\n",
    "%matplotlib inline\n",
    "from scipy.io import loadmat\n",
    "from modshogun import RealFeatures, MulticlassLabels, Math\n",
    "\n",
    "# load the dataset\n",
    "dataset = loadmat('../../../data/multiclass/usps.mat')\n",
    "\n",
    "Xall = dataset['data']\n",
    "# the usps dataset has the digits labeled from 1 to 10 \n",
    "# we'll subtract 1 to make them in the 0-9 range instead\n",
    "Yall = np.array(dataset['label'].squeeze(), dtype=np.double)-1 \n",
    "\n",
    "# 1000 examples for training\n",
    "Xtrain = RealFeatures(Xall[:,0:1000])\n",
    "Ytrain = MulticlassLabels(Yall[0:1000])\n",
    "\n",
    "# 4000 examples for validation\n",
    "Xval = RealFeatures(Xall[:,1001:5001])\n",
    "Yval = MulticlassLabels(Yall[1001:5001])\n",
    "\n",
    "# the rest for testing\n",
    "Xtest = RealFeatures(Xall[:,5002:-1])\n",
    "Ytest = MulticlassLabels(Yall[5002:-1])\n",
    "\n",
    "# initialize the random number generator with a fixed seed, for repeatability\n",
    "Math.init_random(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating the network"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To create a neural network in shogun, we'll first create an instance of [NeuralNetwork](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralNetwork.html) and then [initialize](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralNetwork.html#a8ff6d177c3e2d8977e5fc6920d3e1579) it by telling it how many inputs it has and what type of layers it contains. To specifiy the layers of the network a [DynamicObjectArray](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CDynamicObjectArray.html) is used. The array contains instances of [NeuralLayer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralLayer.html)-based classes that determine the type of neurons each layer consists of. Some of the supported layer types are: [NeuralLinearLayer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralLinearLayer.html), [NeuralLogisticLayer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralLogisticLayer.html) and\n",
    "[NeuralSoftmaxLayer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralSoftmaxLayer.html).\n",
    "\n",
    "We'll create a feed-forward, fully connected (every neuron is connected to all neurons in the layer below) neural network with 2 logistic hidden layers and a softmax output layer. The network will have 256 inputs, one for each pixel (16*16 image). The first hidden layer will have 256 neurons, the second will have 128 neurons, and the output layer will have 10 neurons, one for each digit class. Note that we're using a big network, compared with the size of the training set. This is to emphasize the effects of different regularization methods. We'll try training the network with:\n",
    "\n",
    "* No regularization\n",
    "* L2 regularization\n",
    "* L1 regularization\n",
    "* [Dropout](http://arxiv.org/abs/1207.0580) regularization\n",
    "\n",
    "Therefore, we'll create 4 versions of the network, train each one of them differently, and then compare the results on the validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from modshogun import NeuralNetwork, NeuralInputLayer, NeuralLogisticLayer, NeuralSoftmaxLayer\n",
    "from modshogun import DynamicObjectArray\n",
    "\n",
    "# setup the layers\n",
    "layers = DynamicObjectArray()\n",
    "layers.append_element(NeuralInputLayer(256)) # input layer, 256 neurons\n",
    "layers.append_element(NeuralLogisticLayer(256)) # first hidden layer, 256 neurons\n",
    "layers.append_element(NeuralLogisticLayer(128)) # second hidden layer, 128 neurons\n",
    "layers.append_element(NeuralSoftmaxLayer(10)) # output layer, 10 neurons\n",
    "\n",
    "# create the networks\n",
    "net_no_reg = NeuralNetwork(layers)\n",
    "net_no_reg.quick_connect()\n",
    "net_no_reg.initialize_neural_network()\n",
    "\n",
    "net_l2 = NeuralNetwork(layers)\n",
    "net_l2.quick_connect()\n",
    "net_l2.initialize_neural_network()\n",
    "\n",
    "net_l1 = NeuralNetwork(layers)\n",
    "net_l1.quick_connect()\n",
    "net_l1.initialize_neural_network()\n",
    "\n",
    "net_dropout = NeuralNetwork(layers)\n",
    "net_dropout.quick_connect()\n",
    "net_dropout.initialize_neural_network()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also visualize what the network would look like. To do that we'll draw a smaller network using [networkx](http://networkx.github.io/). The network we'll draw will have 8 inputs (labeled X), 8 neurons in the first hidden layer (labeled H), 4 neurons in the second hidden layer (labeled U), and 6 neurons in the output layer (labeled Y). Each neuron will be connected to all neurons in the layer that precedes it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# import networkx, install if necessary\n",
    "try:\n",
    "    import networkx as nx\n",
    "except ImportError:\n",
    "    import pip\n",
    "    pip.main(['install', '--user', 'networkx'])\n",
    "    import networkx as nx\n",
    "  \n",
    "G = nx.DiGraph()\n",
    "pos = {}\n",
    "\n",
    "for i in range(8):\n",
    "    pos['X'+str(i)] = (i,0) # 8 neurons in the input layer\n",
    "    pos['H'+str(i)] = (i,1) # 8 neurons in the first hidden layer\n",
    "    \n",
    "    for j in range(8): G.add_edge('X'+str(j),'H'+str(i))\n",
    "        \n",
    "    if i<4:\n",
    "        pos['U'+str(i)] = (i+2,2) # 4 neurons in the second hidden layer\n",
    "        for j in range(8): G.add_edge('H'+str(j),'U'+str(i))\n",
    "    \n",
    "    if i<6:\n",
    "        pos['Y'+str(i)] = (i+1,3) # 6 neurons in the output layer\n",
    "        for j in range(4): G.add_edge('U'+str(j),'Y'+str(i))\n",
    "\n",
    "nx.draw(G, pos, node_color='y', node_size=750)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[NeuralNetwork](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralNetwork.html) supports two methods for training: LBFGS (default) and mini-batch gradient descent.\n",
    "\n",
    "[LBFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS) is a full-batch optimization methods, it looks at the entire training set each time before it changes the network's parameters. This makes it slow with large datasets. However, it works very well with small/medium size datasets and is very easy to use as it requires no parameter tuning.\n",
    "\n",
    "[Mini-batch Gradient Descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent) looks at only a small portion of the training set (a mini-batch) before each step, which it makes it suitable for large datasets. However, it's a bit harder to use than LBFGS because it requires some tuning for its parameters (learning rate, learning rate decay,..)\n",
    "\n",
    "Training in [NeuralNetwork](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralNetwork.html) stops when:\n",
    "\n",
    "* Number of epochs (iterations over the entire training set) exceeds [max_num_epochs](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralNetwork.html#a7a2132cd0710750d28eaa4cd51d702af)\n",
    "* The (percentage) difference in error between the current and previous iterations is smaller than [epsilon](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralNetwork.html#a0bde8d297e19e73b20b99110ba38f7bd), i.e the error is not anymore being reduced by training\n",
    "\n",
    "To see all the options supported for training, check the [documentation](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralNetwork.html#pub-attribs)\n",
    "\n",
    "We'll first write a small function to calculate the classification accuracy on the validation set, so that we can compare different models:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from modshogun import MulticlassAccuracy\n",
    "\n",
    "def compute_accuracy(net, X, Y):\n",
    "    predictions = net.apply_multiclass(X)\n",
    "\n",
    "    evaluator = MulticlassAccuracy()\n",
    "    accuracy = evaluator.evaluate(predictions, Y)\n",
    "    return accuracy*100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Training without regularization**\n",
    "\n",
    "We'll start by training the first network without regularization using LBFGS optimization. Note that LBFGS is suitable because we're using a small dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "net_no_reg.set_epsilon(1e-6)\n",
    "net_no_reg.set_max_num_epochs(600)\n",
    "\n",
    "# uncomment this line to allow the training progress to be printed on the console\n",
    "#from modshogun import MSG_INFO; net_no_reg.io.set_loglevel(MSG_INFO)\n",
    "\n",
    "net_no_reg.set_labels(Ytrain)\n",
    "net_no_reg.train(Xtrain) # this might take a while, depending on your machine\n",
    "\n",
    "# compute accuracy on the validation set\n",
    "print \"Without regularization, accuracy on the validation set =\", compute_accuracy(net_no_reg, Xval, Yval), \"%\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Training with L2 regularization**\n",
    "\n",
    "We'll train another network, but with L2 regularization. This type of regularization attempts to prevent overfitting by penalizing large weights. This is done by adding $\\frac{1}{2} \\lambda \\Vert W \\Vert_2$ to the optimization objective that the network tries to minimize, where $\\lambda$ is the regularization coefficient.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# turn on L2 regularization\n",
    "net_l2.set_l2_coefficient(3e-4)\n",
    "\n",
    "net_l2.set_epsilon(1e-6)\n",
    "net_l2.set_max_num_epochs(600)\n",
    "\n",
    "net_l2.set_labels(Ytrain)\n",
    "net_l2.train(Xtrain) # this might take a while, depending on your machine\n",
    "\n",
    "# compute accuracy on the validation set\n",
    "print \"With L2 regularization, accuracy on the validation set =\", compute_accuracy(net_l2, Xval, Yval), \"%\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Training with L1 regularization**\n",
    "\n",
    "We'll now try L1 regularization. It works by by adding $\\lambda \\Vert W \\Vert_1$ to the optimzation objective. This has the effect of penalizing all non-zero weights, therefore pushing all the weights to be close to 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# turn on L1 regularization\n",
    "net_l1.set_l1_coefficient(3e-5)\n",
    "\n",
    "net_l1.set_epsilon(e-6)\n",
    "net_l1.set_max_num_epochs(600)\n",
    "\n",
    "net_l1.set_labels(Ytrain)\n",
    "net_l1.train(Xtrain) # this might take a while, depending on your machine\n",
    "\n",
    "# compute accuracy on the validation set\n",
    "print \"With L1 regularization, accuracy on the validation set =\", compute_accuracy(net_l1, Xval, Yval), \"%\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Training with dropout**\n",
    "\n",
    "The idea behind [dropout](http://arxiv.org/abs/1207.0580) is very simple: randomly ignore some neurons during each training iteration. When used on neurons in the hidden layers, it has the effect of forcing each neuron to learn to extract features that are useful in any context, regardless of what the other hidden neurons in its layer decide to do. Dropout can also be used on the inputs to the network by randomly omitting a small fraction of them during each iteration.\n",
    "\n",
    "When using dropout, it's usually useful to limit the L2 norm of a neuron's incoming weights vector to some constant value.\n",
    "\n",
    "Due to the stochastic nature of dropout, LBFGS optimization doesn't work well with it, therefore we'll use mini-batch gradient descent instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from modshogun import NNOM_GRADIENT_DESCENT\n",
    "\n",
    "# set the dropout probabilty for neurons in the hidden layers\n",
    "net_dropout.set_dropout_hidden(0.5)\n",
    "# set the dropout probabilty for the inputs\n",
    "net_dropout.set_dropout_input(0.2)\n",
    "# limit the maximum incoming weights vector lengh for neurons\n",
    "net_dropout.set_max_norm(15)\n",
    "\n",
    "net_dropout.set_epsilon(1e-6)\n",
    "net_dropout.set_max_num_epochs(600)\n",
    "\n",
    "# use gradient descent for optimization\n",
    "net_dropout.set_optimization_method(NNOM_GRADIENT_DESCENT)\n",
    "net_dropout.set_gd_learning_rate(0.5)\n",
    "net_dropout.set_gd_mini_batch_size(100)\n",
    "\n",
    "net_dropout.set_labels(Ytrain)\n",
    "net_dropout.train(Xtrain) # this might take a while, depending on your machine\n",
    "\n",
    "# compute accuracy on the validation set\n",
    "print \"With dropout, accuracy on the validation set =\", compute_accuracy(net_dropout, Xval, Yval), \"%\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convolutional Neural Networks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll look at a different type of network, namely [convolutional neural networks](http://deeplearning.net/tutorial/lenet.html). A convolutional net operates on two principles:\n",
    "\n",
    "- **Local connectivity**: Convolutional nets work with inputs that have some sort of spacial structure, where the order of the inputs features matter, i.e images. Local connectivity means that each neuron will be connected only to a small neighbourhood of pixels.\n",
    "- **Weight sharing**: Different neurons use the same set of weights. This greatly reduces the number of free parameters, and therefore makes the optimization process easier and acts as a good regularizer. \n",
    "\n",
    "With that in mind, each layer in a convolutional network consists of a number of feature maps. Each feature map is produced by convolving a small filter with the layer's inputs, adding a bias, and then applying some non-linear activation function. The convolution operation satisfies the local connectivity and the weight sharing constraints. Additionally, a max-pooling operation can be performed on each feature map by dividing it into small non-overlapping regions and taking the maximum over each region. This adds some translation invarience and improves the performance.\n",
    "\n",
    "Convolutional nets in Shogun are handled through the [CNeuralNetwork](http://www.shogun-toolbox.org/doc/en/latest/classes.html) class along with the [CNeuralConvolutionalLayer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralConvolutionalLayer.html) class. A [CNeuralConvolutionalLayer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralConvolutionalLayer.html) represents a convolutional layer with multiple feature maps, optional max-pooling, and support for [different types of activation functions](http://www.shogun-toolbox.org/doc/en/latest/namespaceshogun.html#a2b9827281875ee8de764ea86e7735482)\n",
    "\n",
    "Now we'll creates a convolutional neural network with two convolutional layers and a softmax output layer. We'll use the [rectified linear](http://en.wikipedia.org/wiki/Rectifier_(neural_networks)) activation function for the convolutional layers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from modshogun import NeuralConvolutionalLayer, CMAF_RECTIFIED_LINEAR\n",
    "\n",
    "# prepere the layers\n",
    "layers_conv = DynamicObjectArray()\n",
    "\n",
    "# input layer, a 16x16 image single channel image\n",
    "layers_conv.append_element(NeuralInputLayer(16,16,1)) \n",
    "\n",
    "# the first convolutional layer: 10 feature maps, filters with radius 2 (5x5 filters)\n",
    "# and max-pooling in a 2x2 region: its output will be 10 8x8 feature maps\n",
    "layers_conv.append_element(NeuralConvolutionalLayer(CMAF_RECTIFIED_LINEAR, 10, 2, 2, 2, 2))\n",
    "\n",
    "# the first convolutional layer: 15 feature maps, filters with radius 2 (5x5 filters)\n",
    "# and max-pooling in a 2x2 region: its output will be 15 4x4 feature maps\n",
    "layers_conv.append_element(NeuralConvolutionalLayer(CMAF_RECTIFIED_LINEAR, 15, 2, 2, 2, 2))\n",
    "\n",
    "# output layer\n",
    "layers_conv.append_element(NeuralSoftmaxLayer(10))\n",
    "\n",
    "# create and initialize the network\n",
    "net_conv = NeuralNetwork(layers_conv)\n",
    "net_conv.quick_connect()\n",
    "net_conv.initialize_neural_network()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can train the network. Like in the previous section, we'll use gradient descent with dropout and max-norm regularization:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# 50% dropout in the input layer\n",
    "net_conv.set_dropout_input(0.5)\n",
    "\n",
    "# max-norm regularization\n",
    "net_conv.set_max_norm(1.0)\n",
    "\n",
    "# set gradient descent parameters\n",
    "net_conv.set_optimization_method(NNOM_GRADIENT_DESCENT)\n",
    "net_conv.set_gd_learning_rate(0.01)\n",
    "net_conv.set_gd_mini_batch_size(100)\n",
    "net_conv.set_epsilon(0.0)\n",
    "net_conv.set_max_num_epochs(100)\n",
    "\n",
    "# start training\n",
    "net_conv.set_labels(Ytrain)\n",
    "net_conv.train(Xtrain)\n",
    "\n",
    "# compute accuracy on the validation set\n",
    "print \"With a convolutional network, accuracy on the validation set =\", compute_accuracy(net_conv, Xval, Yval), \"%\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "According the accuracy on the validation set, the convolutional network works best in out case. Now we'll measure its performance on the test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print \"Accuracy on the test set using the convolutional network =\", compute_accuracy(net_conv, Xtest, Ytest), \"%\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also look at some of the images and the network's response to each of them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "predictions = net_conv.apply_multiclass(Xtest)\n",
    "\n",
    "_=figure(figsize=(10,12))\n",
    "# plot some images, with the predicted label as the title of each image\n",
    "# this code is borrowed from the KNN notebook by Chiyuan Zhang and Sören Sonnenburg \n",
    "for i in range(100):\n",
    "    ax=subplot(10,10,i+1)\n",
    "    title(int(predictions[i]))\n",
    "    ax.imshow(Xtest[:,i].reshape((16,16)), interpolation='nearest', cmap = cm.Greys_r)\n",
    "    ax.set_xticks([])\n",
    "    ax.set_yticks([])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
